Entropy-based automated wrapper generation for weblog data extraction
نویسندگان
چکیده
منابع مشابه
Self-supervised Automated Wrapper Generation for Weblog Data Extraction
Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extracti...
متن کاملEntropy-based Consensus for Distributed Data Clustering
The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...
متن کاملExample-Based Wrapper Generation
Extracting specific information from the vast amount of documents in the World Wide Web is a very tedious task. Manual extraction has high quality output but cannot be automated. Programmed wrappers, on the other hand, suffer from the uncertainty of document structures. The generation of a more generic wrapper for whole classes of textual information, which can accommodate all kinds of document...
متن کاملWrapper Generation for Web Accessible Data Sources
There is an increase in the number of data sources that can be queried across the WWW. Such sources typically support HTML forms-based interfaces and search engines query collections of suitably indexed data. The data is displayed via a browser. One drawback to these sources is that there is no standard programming interface suitable for applications to submit queries. Second, the output (answe...
متن کاملType-rule-based Wrapper Generation
Biological data sources are useful to bioinformatics researches. Several computational tools have been developed so that these data sources can be used as easily as possible. Most of biological data has been provided over the web. Web data is almost represented in unstructured format and cannot be queried using traditional querying language. Furthermore, the problems, which integration of biolo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: World Wide Web
سال: 2013
ISSN: 1386-145X,1573-1413
DOI: 10.1007/s11280-013-0269-6